Don’t judge a book or health app by its cover: User ratings and downloads are not linked to quality

Objective To analyse the relationship between health app quality with user ratings and the number of downloads of corresponding health apps. Materials and methods Utilising a dataset of 881 Android-based health apps, assessed via the 300-point objective Organisation for the Review of Care and Health Applications (ORCHA) assessment tool, we explored whether subjective user-level indicators of quality (user ratings and downloads) correlate with objective quality scores in the domains of user experience, data privacy and professional/clinical assurance. For this purpose, we applied spearman correlation and multiple linear regression models. Results For user experience, professional/clinical assurance and data privacy scores, all models had very low adjusted R squared values (< .02). Suggesting that there is no meaningful link between subjective user ratings or the number of health app downloads and objective quality measures. Spearman correlations suggested that prior downloads only had a very weak positive correlation with user experience scores (Spearman = .084, p = .012) and data privacy scores (Spearman = .088, p = .009). There was a very weak negative correlation between downloads and professional/clinical assurance score (Spearman = -.081, p = .016). Additionally, user ratings demonstrated a very weak correlation with no statistically significant correlations observed between user ratings and the scores (all p > 0.05). For ORCHA scores multiple linear regression had adjusted R-squared = -.002. Conclusion This study highlights that widely available proxies which users may perceive to signify the quality of health apps, namely user ratings and downloads, are inaccurate predictors for estimating quality. This indicates the need for wider use of quality assurance methodologies which can accurately determine the quality, safety, and compliance of health apps. Findings suggest more should be done to enable users to recognise high-quality health apps, including digital health literacy training and the provision of nationally endorsed “libraries”.


Introduction
According to a report from 2021 there were more than 350,000 health apps available in the iOS and Android stores, with an estimated 250 health apps added every day [1].Moreover, searches for digital health products within app stores have also increased [2].A potential catalyst for this could have been the COVID-19 pandemic and restricted access to incumbent services.Nevertheless, these findings clearly indicate that the public has an interest in health apps.
However, given the large number of health apps on offer, it can be difficult for users to identify high-quality apps that meet their needs.Notably, selecting the low-quality app can be associated with substantial opportunity costs and/or risks.For example, a systematic assessment of suicide prevention and deliberate self-harm mobile health apps found that some apps encouraged risky behaviours such as the uptake of drugs [3].Moreover, reviews across different disease areas have shown that many health apps do not comply with data privacy, sharing, and security standards [4][5][6][7], have safety concerns [8], provide incomplete or misleading medical information [9,10], lack evidence-based components [11], and/or have not been supported by efficacy/effectiveness studies [5,6,12].Also, health experts have largely avoided formally recommending apps, which forces users to obtain recommendations from other sources [13].Therefore, if not sufficiently informed, user's app choices can result in poor health benefits if ineffective apps are chosen and/or pose significant risks to user's health and privacy.
Notably, in the absence of guidance, users are likely to select health apps based on metrics that they perceive to be proxies for quality, such as prior purchases/downloads and user ratings.For instance, a study from 2020 [14], found that besides price, in-app purchase options, and presence of in-app advertisements, user ratings were impactful predictors of user downloads, and the number of downloads increased with average user ratings.However, while metrics such as user ratings may be useful when selecting many other goods and services, they may not accurately reflect the value and risks associated with the use of health apps [15], as these aspects are complex to assess and often not immediately apparent to (prior) users of the app.
In line with this, previous studies have shown that app quality ratings are often not significantly positively associated with user ratings.For instance, user ratings were found not to be significantly correlated with Mobile Application Ratings Scale (MARS) scores [16,17] or 'Psy-berGuide credibility ratings scale' (PGCRS) scores [18].A study from 2022 [19], found a weak but significant negative correlation between their criteria and scores and user ratings for women with anxiety during pregnancy.
These findings suggest that user ratings and downloads are not a good proxy for overall app quality.However, most frameworks are not all-encompassing [20][21][22][23], for example, the MARS doesn't include privacy questions.Hence, from the previous findings, it is unclear whether user ratings and download rates may be associated with compliance with quality components, such as user experience (UX), professional/clinical assurance (PCA) and data privacy (DP).The current study aimed to examine this relationship.
Specifically, this study's objective is to analyse the relationship between health app quality scores (UX, PCA and DP) with user ratings and the number of downloads of corresponding health apps.This study has one hypothesis, user ratings and number of downloads are inadequate predictors of user experience, professional/clinical assurance, and data privacy of health apps.

The dataset provenance
The dataset used for this study was provided by the Organisation for the Review of Care and Health Applications (ORCHA).ORCHA is a United Kingdom (UK) based digital health compliance company that specialises in the assessment of health apps.ORCHA provides an 'ORCHA library' that contains information about health apps that have been assessed regarding professional/clinical assurance, data privacy and user experience, allowing consumers and clinical professionals to make informed decisions whether to use or recommend these health apps.ORCHA is currently working with 70% of the National Health Service (NHS) organisations within England.
ORCHA has provided a dataset comprising 2127 health app assessments which were assessed using the ORCHA Baseline Review tool, Version 6 (OBR V6) [24].For this study 881 Android health apps have been used, the steps involved in the inclusion of health apps can be found in Fig 1 of S1 Appendix.The OBR V6 tool is the latest version of the 'ORCHA assessment tool' which consists of ~300 objective assessment questions (where most questions are objective dichotomous-yes/no questions).OBR V6 assesses three aspects of a health app, namely 1) professional/clinical assurance (PCA), 2) data privacy (DP), and 3) user experience (UX) (also referred to as 'usability and accessibility').Each of these three domains is scored individually on a scale from 0 to 100 and these three domain scores are combined into an overall ORCHA score.
The dataset consists of the aggregated user ratings, number of downloads and quality scores (UX, PCA and DP scores) for each health app.Each assessment of the 881 health apps has been carried out by at least 2 trained reviewers, where in the case of a dispute, a third reviewer would resolve it.All reviewers have undergone the same training to use the OBR V6 assessment tool.The dataset used included health app assessments that were published between 18 th January 2021 and 6 th January 2022.

Statistical analysis and modelling
Data was accessed and analysed between July and December 2022.We carried out secondary data analyses of this ORCHA dataset, using R studio and R programming language.Spearman correlations were used to examine how correlated ORHCA, UX, PCA and DP scores are with user ratings (a 1-5 ratings) and number of downloads.The number of downloads variable was converted into download levels, as only download ranges, not exact numbers of downloads, were available.There were 20 ranges of downloads, and each was assigned a download level going from 1 (the smallest) to 20 (the highest).For the analysis the smallest value in each of the 20 ranges was also used as an alternative to the download levels.This was done to improve rigour of the analysis by using two approaches to estimate number of downloads from the available range of downloads.
Multiple linear regression (MLR) was used to model the relationship between app quality scores and the apps' user ratings and downloads.R squared and adjusted R squared metrics were used to measure the fitness of the models.For all statistical tests, a p-value < .013(Bonferroni-corrected for multiple hypothesis testing) was considered statistically significant.If there are any links among user ratings and downloads, and quality scores they should be revealed by spearman correlations and/or MLR.

Ethical approval
This secondary data analytics study was approved by Ulster University (ethics filter committee for Faculty of Computing, Engineering and the Built Environment).The process undertaken by ORCHA ensures that health app developers are aware of their score and are given time to contest findings of the assessment which may be amended if developers provide additional relevant information.All reviews, unless explicitly asked to be removed by the developer, are covered as suitable for research in ORCHA's privacy policy.

Results
There was a total of 881 Android health apps used for this study.The categories of health apps and sample size (n) used in this study are depicted in Table 1 in descending order of sample size.Each health app has been assigned to one or multiple categories.
Table 5 shows the results of MLR, predicting all the assessment scores (separately) with user ratings and download levels.Adjusted R squared was very small for all the scores; however, Ftest p-values were statistically significant for UX (p = .005)and DP (p = .003)scores.To make examination of the data more rigorous, the smallest value in the range of values recorded by ORCHA (ORCHA recorded downloads-with plus removed) were also used for comparison.

Principal findings
This study shows that user ratings and number of downloads are inadequate at predicting the quality of health apps.User ratings and download levels demonstrated weak correlations with all scores (ORCHA, UX, PCA and DP) and each other, as shown in Table 4 (only user ratings and downloads achieved statistically significant correlation with each other when using Bonferroni corrected alpha).Most scores showed a negative correlation with user ratings; UX was the only score that had a positive correlation-albeit weak and not significant.UX and DP scores were positively correlated with download levels, whilst ORCHA and PCA showed a negative correlation with the latter.The MLR models had low R squared values (< .02),as shown in Table 5, meaning that a lot of the variance in the model is unexplained by the model.This further indicates the inadequacy of user ratings and downloads at predicting scores (ORCHA, UX, PCA and DP).
Our findings indicate that user ratings and download levels are not accurate predictors of objective app quality.This suggests that users have difficulty determining, as a basis for their ratings and download decisions, key aspects that contribute to app quality and safety.A potential contributing factor to this may be a lack of digital health literacy.A study from 2021 described digital health literacy and internet connectivity as "super social determinants of health" [25], because they have implications for the wider social determinants of health.A study from 2017, found that individuals who were younger, had more education, reported excellent health, and had a higher income were the main users of health apps [26].
Moreover, our findings are in line with a study from 2022, which provided evidence of a gap between the user ratings and expert ratings from a curated library of over 1,200 apps that covered physical and mental health [27].Our results suggest that the cause of this gap may be that health experts look for evidence of clinical quality, utility, privacy, and security that is not considered by users when they rate apps on the iOS and Android app stores.Moreover, users who get their health information, from the internet, rely on search engine results, that may come from unaccredited sources [28].This indicates that a trusted objective way to judge the quality of health apps is needed.
The study conducted in this paper highlights the need for quality assurance methodologies/ tools to accurately determine the quality, safety and compliance of health apps.Our results are in line with the hypothesis that "user ratings and number of downloads are inadequate predictors of user experience, professional/clinical assurance, and data privacy of health apps".The lack of correlation observed between quality assessment tools and user ratings and downloads of health apps suggest that many users use harmful and unsafe health apps, which may partly be due to poor digital health literacy.These issues need to be addressed as departments of health, for example the Food and Drug Administration of the United States [29] or Health and Social Care Northern Ireland [30], are moving towards embracing digital health technologies such as health apps.

Limitations
This study was limited to Android health apps only, therefore, inclusion of iOS apps, while not expected to be systematically different, may have yielded different findings.User ratings and the number of downloads of health apps included in this study could have changed by the time this study has been published.Additionally, as with any study in digital health, these technologies are highly flexible and subject to change, with updates occurring on a regular basis.Therefore, it is entirely possible that either or both objective compliance of the apps and the number of downloads or user ratings, may have changed since the study began, stressing the need for follow up studies.
OBR is performed by humans and therefore it is entirely possible, although unlikely, that errors can occur in the objective assessment of health apps.The sample size for user ratings ranges (from 8 to 608) and download levels (from 0 to 177) varied widely.Only range of downloads as shown in Table 2 was available for analysis; the exact number of downloads for each health app was unavailable for this study.Which means that precision was not possible, leading to overestimation of download Figs for some and under estimation for others, a natural side effect of transforming continuous data into categorical variables.

Conclusion
This study shows that online user app ratings and the number of app downloads are inadequate predictors of the quality of the health apps in terms of their user experience, professional/clinical assurance, and data privacy.This indicates the need for quality assurance methodologies/tools to accurately determine the quality, safety and compliance of health apps.It also suggests that the success and uptake of a health app is not based on its quality, which is a worrying prospect given the need for high quality health apps and given the need for digital health literacy amongst citizens.It is important that users self-select high quality health apps as opposed to being misled by user ratings and the popularity of an app.

Figs 1 and 2
Figs 1 and 2 depict how scores' medians vary with user ratings and download levels.Independent scores UX, PCA and DP are represented with green, blue and purple lines colours respectively and the dependent ORCHA score is depicted with a red line.The download levels of '1, 2, 3 and 19' are not included since the sample size was 0. Fig 1 in S2 Appendix depicts boxplots for each score per user ratings in the intervals of > = 1 and <2, > = 2 and <3, > = 3 and <4, > = 4 and < = 5.Figs 2-5 in S2 Appendix depicts each score per download level.Sample size is above each boxplot.